use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

neuralhax · 2019-06-28T14:15:15Z

When using latest Cortex version (currently 3.0.0-RC3) which is using cortexutils 2.0.0, I noticed that MaxMind_GeoIP_3_0 analyzer always fails with "Invalid IP address" error, e.g.:

{
  "errorMessage": "Invalid IP address",
  "input": "{\"pap\":2,\"tlp\":2,\"parameters\":{},\"dataType\":\"ip\",\"data\":\"1.1.1.1\",\"message\":\"\",\"config\":{\"check_pap\":true,\"check_tlp\":true,\"proxy_https\":null,\"jobCache\":10,\"max_tlp\":2,\"auto_extract_artifacts\":false,\"cacerts\":null,\"jobTimeout\":30,\"proxy_http\":null,\"max_pap\":2}}",
  "success": false
}

The root cause of that error is following exception:

Traceback (most recent call last):
  File "/opt/Cortex-Analyzers/analyzers/MaxMind/geo.py", line 88, in run
    'traits': self.dump_traits(city.traits)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/analyzer.py", line 106, in report
    }, ensure_ascii)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/worker.py", line 178, in report
    self.__write_output(output, ensure_ascii=ensure_ascii)
  File "/usr/local/lib/python2.7/dist-packages/cortexutils/worker.py", line 123, in __write_output
    json.dump(data, f_output, ensure_ascii=ensure_ascii)
  File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
    fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-7: ordinal not in range(128)

Latest Cortex 2.x version using cortexutils 1.3.0 doesn't have such problem. Difference is that in cortexutils 2.0.0, the report is written into the file instead of standard output. Standard output is set to use utf-8 encoding in __set_encoding() function, but the same is not done when writing into the file. Python 2 is using ascii codec by default and when json.dump() function is used with ensure_ascii=False argument (as in this case) and data contains non-ASCII characters as well, it will lead to UnicodeError. Such behaviour is also described in Python json.dump() documentation:

If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
output are escaped with ``\uXXXX`` sequences, and the result is a ``str``
instance consisting of ASCII characters only.  If ``ensure_ascii`` is
false, some chunks written to ``fp`` may be ``unicode`` instances.
This usually happens because the input contains unicode strings or the
``encoding`` parameter is used. Unless ``fp.write()`` explicitly
understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
cause an error.

Unless I'm missing something, I believe this can be fixed by using utf-8 encoding for the output file in the same way as is done in __set_encoding() function for sys.stdout and sys.stderr. With this patch, I no longer observe the reported error.

DarkZatarra · 2019-11-13T14:46:31Z

I confirm the bug, I had the same problem and now it's fixed

…ings

neuralhax · 2019-11-26T11:41:30Z

Unless I'm missing something, ...

Well, I have been missing something indeed. The original patch fixed problem with MaxMind analyzer, but introduced problems with other analyzers, e.g. AbuseIPDB, where I started to observe following error:
"No content to map due to end-of-input\n at [Source: (sun.nio.ch.ChannelInputStream); line: 1, column: 0]"
Difference between these two analyzers is that MaxMind uses Python 2 and AbuseIPDB uses Python 3. In Python 2, we can store text either as str type or as unicode type. In Python 3, str replaced unicode and bytes were introduced to replace Python2's str. Non-ASCII characters in Python2's str are by default already encoded in UTF-8, whereas in unicode they're code points not encoded by default. Another difference is that in Python 2 the default encoding for files is ASCII and in Python 3 it's UTF-8.
When doing json.dump(), and string contains non-ASCII Unicode code points, it will fail with UnicodeEncodeError to encode that character using ASCII codec. When the file writer is using UTF-8 codec, it will succeed. However, if string already contained UTF-8 encoded characters, it will fail with UnicodeDecodeError because it assumed ASCII input and those characters couldn't be decoded. This patch f109746 should cover both cases (unless I'm missing something again). However, it will still fail if the written data would contain mix of UTF-8 encoded characters and Unicode code points. Hope it stays as theoretical problem only...

DarkZatarra · 2019-11-28T23:24:20Z

@xlaruen you are right. I was able to reproduce the abuseip part and I've applied your commit and it works, for now

use utf-8 codec if ensure_ascii is False to avoid UnicodeError

49daf9d

support both python2's non-ascii unicode literals and utf-encoded str…

f109746

…ings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

neuralhax commented Jun 28, 2019

DarkZatarra commented Nov 13, 2019

neuralhax commented Nov 26, 2019

DarkZatarra commented Nov 28, 2019

use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

Are you sure you want to change the base?

use utf-8 codec if ensure_ascii is False to avoid UnicodeError #5

Conversation

neuralhax commented Jun 28, 2019

DarkZatarra commented Nov 13, 2019

neuralhax commented Nov 26, 2019

DarkZatarra commented Nov 28, 2019